Regression: predictive modelling – Part 1

ENVX2001 - Applied Statistical Methods

Januar Harianto

School of Life and Envoronmental Sciences

The University of Sydney

Mar 2024

Predictive modelling

“The best way to predict the future is to create it.”

– Peter Ferdinand Drucker, 1909–2005

Our workflow so far

Workflow

  1. Model development
    • Explore: visualise, summarise
    • Transform predictors: linearise, reduce skewness/leverage
    • Model: fit, check assumptions, interpret, transform. Repeat.
  1. Variable selection
    • VIF: remove predictors with high variance inflation factor
    • Model selection: stepwise selection, AIC, principle of parsimony, assumption checks
  1. Predictive modelling
    • Predict: Use the model to predict new data
    • Validate: Evaluate the model’s performance

Making predictions

Previously on ENVX2001…

We fitted a multiple linear regression model to the data (slide 62).

library(tidyverse)
multi_fit <- lm(log(Ozone) ~ Temp + Solar.R + Wind, data = airquality)
summary(multi_fit)

Call:
lm(formula = log(Ozone) ~ Temp + Solar.R + Wind, data = airquality)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.06193 -0.29970 -0.00231  0.30756  1.23578 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.2621323  0.5535669  -0.474 0.636798    
Temp         0.0491711  0.0060875   8.077 1.07e-12 ***
Solar.R      0.0025152  0.0005567   4.518 1.62e-05 ***
Wind        -0.0615625  0.0157130  -3.918 0.000158 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5086 on 107 degrees of freedom
  (42 observations deleted due to missingness)
Multiple R-squared:  0.6644,    Adjusted R-squared:  0.655 
F-statistic: 70.62 on 3 and 107 DF,  p-value: < 2.2e-16

\widehat{log(Ozone)}=-0.262 + 0.0492 \cdot Temp + 0.00252 \cdot Solar.R - 0.0616 \cdot Wind

Predict: equation

\widehat{log(Ozone)}=-0.262 + 0.0492 \cdot \color{darkorchid}{Temp} + 0.00252 \cdot \color{darkorange}{Solar.R} - 0.0616 \cdot \color{seagreen}{Wind}

On a certain day, we measured (units are Imperial):

  • temperature Temp to be 80 degrees Fahrenheit
  • solar radiation Solar.R to be 145 units (Langleys)
  • wind speed Wind to be 10.9 miles per hour

What is the predicted ozone level? . . .

\widehat{log(Ozone)}=-0.262 + 0.0492 \cdot \color{darkorchid}{80} + 0.00252 \cdot \color{darkorange}{145} - 0.0616 \cdot \color{seagreen}{10.9}

Easy! The two things we need to think about are…

  • What is the uncertainty in this prediction?
  • Can this model be used to predict ozone if we collect new data in the future?

Uncertainty

  • Confidence interval: uncertainty in the mean response at a given predictor value.
  • Prediction interval: uncertainty in a single response at a given predictor value.


What it means

95% confidence interval: Given the parameters of the model, we are 95% confident that the mean response at a given predictor value is between y_1 and y_2.

95% prediction interval: Given the parameters of the model, we are 95% confident that a single response at a given predictor value is between y_1 and y_2.

Why the distinction?

  • Confidence interval: we are interested in the mean response.
  • Prediction interval: we are interested in a single value prediction.

Confidence interval (CI)

CI: standard error of the fit

se(\widehat{y}) = \sqrt{MSE \cdot \left( \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{\sum_{i=1}^n (x_i - \bar{x})^2} \right)} where x_0 is the predictor value at which we want to predict the response

  • MSE is the mean squared error of the fit (residual ms)
  • \sum_{i=1}^n (x_i - \bar{x})^2 is the sum of squares of the predictor values
  • n is the number of observations
  • \bar{x} is the mean of the predictor values

Prediction interval (PI)

PI: standard error of the prediction

se(\widehat{y}) = \sqrt{MSE \cdot \left( 1 + \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{\sum_{i=1}^n (x_i - \bar{x})^2} \right)} where x_0 is the predictor value at which we want to predict the response

  • MSE is the mean squared error of the fit (residual ms)
  • \sum_{i=1}^n (x_i - \bar{x})^2 is the sum of squares of the predictor values
  • n is the number of observations
  • \bar{x} is the mean of the predictor values

Note

The only difference between the CI and PI is the additional term 1 in the PI formula that is added. The reason for this is that we are interested in a single response, not the mean response.

Predictions in R

  • We can use the predict() function
  • First, we need to create a new data frame with the predictor values we want to predict at
to_predict <- data.frame(Temp = 80, Solar.R = 145, Wind = 10.9)
  • Then, we can use the predict() function to predict the response at these values
  • Use interval = "confidence" or interval = "prediction" to get the confidence or prediction interval.
predict(multi_fit, newdata = to_predict, interval = "confidence")
       fit      lwr      upr
1 3.365227 3.246265 3.484189
predict(multi_fit, newdata = to_predict, interval = "prediction")
       fit      lwr      upr
1 3.365227 2.350051 4.380404

Comparing CI vs PI

  • The confidence interval is narrower than the prediction interval.
  • It’s easier to visualise two-dimensional data, so let’s look at a simple linear regression model of log(Ozone) vs. Temp.
fit <- lm(log(Ozone) ~ Temp, data = airquality)
  • We create a range of predictions across all possible Temp values in 0.1 °F increments, and calculate both the CI and PI for each of those values
# Generate values to predict at in 0.1 degree increments
to_pred <- data.frame(Temp = seq(min(airquality$Temp), max(airquality$Temp), by = 0.1))
preds_ci <- predict(fit, newdata = to_pred, interval = "confidence") # confidence interval
preds_pi <- predict(fit, newdata = to_pred, interval = "prediction") # prediction interval
  • Extract only upper and lower CI and PI values and merge the data frames
pred_df <- data.frame(Temp = to_pred$Temp,
                      Lci = preds_ci[, "lwr"],
                      Uci = preds_ci[, "upr"],
                      Lpi = preds_pi[, "lwr"],
                      Upi = preds_pi[, "upr"])

Visualising CI vs PI

  • We can now plot the CI and PI as shaded areas around the predicted line
Code
p <-
  ggplot(airquality, aes(Temp, log(Ozone))) +
  geom_point() + 
  geom_line(data = pred_df, aes(Temp, Lci), color = "blue") +
  geom_line(data = pred_df, aes(Temp, Uci), color = "blue") +
  geom_line(data = pred_df, aes(Temp, Lpi), color = "red") +
  geom_line(data = pred_df, aes(Temp, Upi), color = "red") +
  labs(x = "Temperature (F)", y = "log(Ozone)") +
  theme_bw()
p

CI and geom_smooth()

  • Notice that geom_smooth() uses the CI, not the PI.
Code
p + geom_smooth(method = "lm", se = TRUE)

Limitations

All is good when we want to assess uncertainty in a model that we have already fit. But what if we want to know how well the model predicts new data, i.e. data that we did not use to fit the model?

What we need

  • A way to estimate how well the model predicts new data that hasn’t been used to fit the model, i.e. an independent dataset.
  • Because we have the actual values in the new dataset, we can compare them to the predicted values from the model.
    • If the model is good, we expect the predictions to be close to the actual values.
    • If the model is bad, we expect the predictions to be far from the actual values.

Model validation

General idea

  • We have a dataset that we use to fit a model, and want to assess how well the model predicts new data by performing model validation.
  • We can:
    • use the same dataset to assess how well the model fits the data (e.g. r^2); or
    • use a different dataset to assess how well the model predicts new data (e.g. RMSE).
  • The dataset can be obtained by:
    • Collecting new data.
    • Splitting the existing data into two parts before model building.
    • Cross-validation or k-fold cross-validation of existing data.

Collecting new data

Dataset

Collecting new data

+

Dataset

New dataset

Collecting new data

+

Dataset

New dataset



  • The best way to assess how well a model predicts new data is to collect new data.
    • Training set: used to fit the model.
    • Test set: used to assess how well the model predicts new data.

Pros

  • The new data is completely independent of the data used to fit the model.
  • Compared to splitting existing data, we have more data to fit the model and more data to validate the model.

Cons

  • It can be expensive and time-consuming to collect new data.
  • Some data may be impossible to collect (e.g. historical data).

Data splitting

Dataset

Data splitting

Dataset

(Training)

Subset (Test)

Data splitting

Dataset

(Training)

Subset (Test)



  • Split the existing data into two parts:
    • Training set: used to fit the model.
    • Test set: used to assess how well the model predicts new data.

Pros

  • The test set is completely independent of the training set.
  • Compared to collecting new data, it is cheaper and faster to split existing data.

Cons

  • We have less data to fit the model and less data to validate the model.
  • How do we split the data?

Cross-validation

Cross-validation

Cross-validation

Cross-validation

Iteration 1

Iteration 2

Iteration 3

And so on…

Cross-validation

Iteration 1

Iteration 2

Iteration 3

And so on…


  • Like data splitting, where existing data is split into two parts:
    • Training set: used to fit the model.
    • Test set: used to assess how well the model predicts new data.
  • The difference is that the splitting is done multiple times, and the model is fit and validated multiple times.

Pros

  • Same as data splitting, but also:
    • The model is fit and validated multiple times, so we can get a better estimate of how well the model predicts new data.

Cons

  • We have less data to fit the model and less data to validate the model.
  • It can be computationally expensive to perform cross-validation.

k-fold cross-validation

k-fold cross-validation

k-fold cross-validation

k-fold cross-validation

Iteration 1 Iteration 2 Iteration 3 3-fold cross-validation Fold 1 Fold 2 Fold 3

k-fold cross-validation

Iteration 1 Iteration 2 Iteration 3 3-fold cross-validation Fold 1 Fold 2 Fold 3


  • split data into k groups (folds)
  • Train on k-1 folds, Test on the remaining fold
  • All folds are used for testing once

Pros

  • Same as cross-validation, but also:
    • Better use of all available data
    • Greatly reduces overfitting as the model’s performance is not just a result of the particular way the data was split.

Cons

  • Computationally expensive, since all data is used for training and testing
  • Bias in small datasets: each fold may contain too little data to provide a representative sample

Assessing prediction quality

Root-mean-square error (RMSE)

The most common metric for comparing the performance of regression models. RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}

Approximately, the standard deviation of the residuals.

  • A measure of accuracy for the model.
  • Unlike r^2, RMSE is not a relative measure, but is scaled to the units of y.
  • The smaller the RMSE, the better the model.

Mean error (ME)

The average of the residuals.

ME = \frac{1}{n}\sum_{i=1}^{n}y_i - \hat{y}_i

Averaged difference between the predicted and observed values.

  • A measure of bias for the model.
  • Also scaled to the units of y.
  • Can be positive or negative to indicate over- or under-estimation.

Lin’s concordance correlation coefficient (CCC)

A modification of Pearson’s correlation coefficient that takes into account the deviation of the observations from the identity line (i.e., the 45° line where the values of the two variables are equal).

CCC = \frac{2\text{Cov}(X,Y)}{\text{Var}(X) + \text{Var}(Y) + (\mu_X - \mu_Y)^2}

An “agreement” value that takes into account covariance, variances, and difference in means.

  • A measure of precision for the model.
  • Ranges from -1 to 1, with 1 indicating perfect agreement.
  • Unitless and scale-invariant.

CCC vs Pearson correlation coefficient

Code
library(tidyverse)

df <- tibble(y = seq(0, 100, 5),
  "45 degree line | CCC = 1" = seq(0, 100, 5)) %>%
  mutate("Location shift | CCC = 0.89" = `45 degree line | CCC = 1` - 15) %>%
  mutate("Scale shift | CCC = 0.52" = y / 2) %>%
  mutate("Location and scale shift | CCC = 0.67" = y * 2 - 20)

# pivot
df_long <- df %>%
  pivot_longer(-1, values_to = "x") %>%
  mutate(name = factor(name, 
    levels = c("45 degree line | CCC = 1",
      "Location shift | CCC = 0.89",
      "Scale shift | CCC = 0.52",
      "Location and scale shift | CCC = 0.67")))

ggplot(df_long, aes(x, y)) +
  geom_abline(intercept = 0, slope = 1, size = 0.5, colour = "grey") +
  facet_wrap(~name) +
  geom_point() +
  xlim(0, 100) +
  labs(x = "", y = "") +
  theme_bw() +
  geom_blank() 

All of the above have a Pearson correlation coefficient of 1.

Next Lecture

We will go through several examples to practice data splitting, cross-validation, and model evaluation.

Thanks!

Questions? Comments?

Slides made with Quarto